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1 Abstract 

In this paper, we attempts to learn a single metric across two heterogeneous do- 
mains where source domain is fully labeled and has many samples while target 
domain has only a few labeled samples but abundant unlabeled samples. To 
the best of our knowledge, this task is seldom touched. The proposed learning 
model has a simple underlying motivation: all the samples in both the source 
and the target domains are mapped into a common space, where both their 
priors P(sample)s and their posteriors P(label\sample)s are forced to be re- 
spectively aligned as much as possible. We show that the two mappings, from 
both the source domain and the target domain to the common space, can be 
reparameterized into a single positive semi-dehnitc(PSD) matrix. Then we de- 
velop an efficient Bregman Projection algorithm to optimize the PDS matrix 
over which a LogDet function is used to regularize. Furthermore, we also show 
that this model can be easily kernelized and verify its effectiveness in cross- 
language retrieval task and cross-domain object recognition task. 
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2 Introduction 



Metric learning lies in the heart of many machine learning tasks such as clus- 
tering and recognition, thus has been extensively studied by many researchers. 
However, most of the works only focus on learning metric for a single domain 1 151 
|2"HI l2"rJl [231 [T51 127] , leaving metric learning across multiple-domains seldom 
touched. In this paper, we introduce the Metric Learning across Heterogeneous 
Domains(MLHD) model to learn a single metric across two heterogeneous do- 
mains, which means not only their sample distributions but also their feature 
spaces are different. Between the two domains, the source domain, which has 
been collected beforehand, are fully labeled and has many samples. While the 
target domain has only a few labeled samples but has abundant unlabeled sam- 
ples because collecting labels is expensive and tedious. Since the samples in 
the two domains may disagree on their feature dimensions, the metrics between 
them can not be calculated directly. A simple and direct idea, as depicted in 
Figure [T] is to linearly map the samples in both the domains into a common 
space where their metrics can be calculated. And at the same time, the two 
mappers, from both the source domain and the target domain to the common 
space, should satisfy some constraints: First, samples sharing the same labels in 
both the domains should be close to each other in the common space. In other 
words, posterior P(label\sample)s of both the domains in the common space 
should be aligned as closely as possible. As demonstrated in Figure 1(a) the 
red circles and squares are close to each other, and so are the blue circles and 
squares in that common space. However, this is not enough because the target 
domain only owns a few labeled samples. And if the posteriors are aligned only 
based on such a small portion of labeled samples in the target domain, they 
can be likely biased and lead to poor generalization. Figure 1(a) shows that 
the gray unlabeled samples are aligned poorly though the color labeled samples 
are aligned well. To alleviate this problem, we also need to force the priors 
P(sample)s of both the domains are aligned as much as possible. Since the 
target domain usually contains many unlabeled samples, the estimation of its 
prior is relatively more reliable. And by aligning the priors, the probable bias 
introduced by the poor posterior of the target domain can be corrected to some 
extent. Figure 1(b) shows that both the unlabeled and the labeled samples are 
well located in the common space by respectively aligning the priors and the 
posteriors. 
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(a) Aligning posteriors (b) Aligning priors and posteriors 



Figure 1: Mapping the samples of the source and the target domains to a 
common space. Circles and squares are the samples in the target and the source 
domains respectively. Colors indicate the labels of the samples. Gray represents 
the unlabeled samples. 

One advantage of our model, deserved to be highlighted, is that the learned 
metric is actually defined in the common space. That means we can calculate the 
metric across the source domain and the target domain. This advantage brings 
much convenience for some practical applications. For example, this metric can 
directly serve the retrieval tasks cross heterogeneous domains such as cross- 
language retrieval and cross-domain object recognition. Although there are a 
few works about metric learning on heterogeneous domains, some of them focus 
on learning metric only in the target domain [15], and only rare focus on learning 
metric(or similarity) across heterogeneous domains. Kate et al.pjj] proposed to 
learn a metric across different image domains, however their model is limited to 
the situation where the dimensions of both the source and the target domains 
are the same. Kulis et al. [12] broke this limit, however their model learns a 
similarity function which only calculates the cross-domain similarity. Moreover, 
both of the above works just align the posteriors from this paper's perspective, 
thus do not exploit the unlabeled samples of the target domain, which generally 
could be very useful. 

The formulation of our MLHD model consists of three parts: the first two 
parts force the alignments of the priors and the posteriors between the two 
domains. The priors are aligned by minimizing a two-sample testing statistic, 
Maximum Mean Discrepancy (MMD), proposed by Gretton et al.[9]. And the 
posteriors are aligned by keeping the samples in the same class close enough 
while in the different classes as far as possible. We show that the two parts 
can be reparameterized with a single PSD matrix. Then we introduce a LogDct 
rcgularizer over the PSD matrix as the third part to avoid the troublesome PSD 
constraint because it can be automatically satisfied [7J [13] by using Brcgman 
Projection algorithm if the objective is LogDet function. Besides, to deal with 
the nonlinear situation, we also kernelized our model based on the kernclization 
works about LogDet function [11[ [TU], Our paper provides the detailed proofs 
on kernelization. 
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The rest of the paper is organized as follows: Section [3] reviews some related 
works, and section [4] presents the MLHD model in detail. Then Section[5]verifies 
the effectiveness of MLHD experimentally. Finally section[S]conclude this paper. 

3 Related Work 

Learning among heterogeneous data sources has caught much researchers' atten- 
tion. However, most of them focus on the classification or dimension reduction 
learning in the target domain with the help of the source domains. Only rare of 
them focus on the metric learning across the heterogeneous domains. Thus in 
this section, besides of the related works on metric learning across the heteroge- 
neous domains, we also review some recent heterogeneous classification learning 
works at first. 

Heterogeneous learning may date back to Dai et al.JS]. They used some 
co-occurrence data to estimate the feature-level conditional distribution from 
source feature to target feature. Later, many other methods were proposed [251 
[23l [24l [20l [§]. A common character of these methods is that they all map 
the samples in the source and the target domains into a common space for 
the learning tasks. For example, Wang et al.[24] embedded all the samples in 
different domains into a common space according to a large manifold structure 
covering both the within-domain geometrical structure and the between-domain 
label structure. Zhang et al.[2j|] mapped all the samples into a common space 
and applied the classic linear discriminant analysis(LDA). Shi et al.[20] used 
a collective matrix factorization model to find out the common space. How- 
ever, the algorithm requires the same number of samples of source and target 
domains, which is usually could not be satisfied. Thus before conducting the al- 
gorithm, they had to bring in a sampling procedure. Duan et al.[5] constructed 
a parameterized augmented space as the common space motivated by a domain 
adaptation method proposed by Daume et al. did[B]. And the parameters are 
learned through optimizing a large margin classification model. 

Although learning among heterogeneous data sources has attracted much 
attention, works on metric learning across heterogeneous domains are relatively 
rare. Qi et al.[TS] focused on metric learning only for the target domain, but 
not that across the source and the target domains, thus concern different setting 
from ours. To the best of our knowledge, Kulis et al.[T5J's work is the only one 
closest to ours, although what they learned is, strictly speaking, a similarity 
function rather than a metric across the source and the target domains. They 
proposed a Frobenuis-norm regularized large margin model to learn the (linear) 
similarity function, which, from this paper's perspective, can be seen as only 
aligning the posteriors rather than the priors. Thus, they don't explore the 
abundant available unlabeled samples in target domain to leverage the learning. 
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4 Metric Learning across Heterogeneous Domains 



In this section, we first present some notations, then give out the mathematical 
model. Next, we optimize the model with Brcgman projection method. Finally, 
we show how to kernelize this model. 



4.1 Problem Statement and Notations 



We first provide some notations used throughout this paper. Assuming that 
we are given two domains: a D x dimensional labeled source domain X = 
{(xi, lf)\i = 1, 2, • • ■ , N x }, and a D y dimensional partially labeled target domain 
y = {(y i ,l v i )\i = l,2,---,N?}U{y i \i = Nf + l,---,N? + Ny}. Let N y = Nf + 
N"y. For convenience, we also define two data matrix X = [x\,X2, ■ ■ ■ ,xn*] € 
j^d^xn* anc j y = [j/i,2/2,--- ,Vnv] 6 TZ DVxNV . Then two linear operators 
W x £ TZ D xD and W y € j^d v xd are usec j ^ ma p the samples in the two 
domains into a D c dimensional common space. Specifically x — > W% x and 
y — > Wyy. And the metric is defined as the 2-norm d(x, y) = \\W^x — Wyy\\-2- 
Furthermore the squared metric can be rewritten into a matrix form as follows: 



d 2 (xi,yj) 



w x w? w x wt 



w y wl 



WyWf 



If we let 



M = 

and Zij = [xf — yJ] T ', then we have 



W X W? w x wl 



WyWi WyW + 



d M (xi,yj) = z ij Mz ij (1) 

which is reparameterized only by matrix M £ S+, where S+ denotes the set 
containing all the symmetric positive semi-define matrices. 

The goal of metric learning across the heterogeneous domain is to learn the 
parameterized metric d defined above by using the data in both the source 
domain X and the target domain y. 



4.2 Formulation 

In this subsection, we propose our Metric Learning across Heterogeneous Do- 
main(MLHD) model, which fully exploits both the labeled and the unlabeled 
samples in the two domains. And to reach this goal, we force the model to align 
not only the posteriors but also the priors of the two domains. 

Aligning the posteriors amounts to forcing the samples in the same class 
close enough while the samples in the different classes far away. And it is easy 
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to achieve by imposing the following distance constraints: 



d 2 ( Xi , yj )>i i£if = q (2) 
d 2 ( Xl , yj )<u if if : /; (3) 

Aligning the priors is closely related to the statistical two-sample testing 
problem, which determines whether two random variables have the same distri- 
bution. So first of all, let us briefly introduce the method, proposed by Gretton 
et al.[5], for the two-sample testing problem. In that paper, the authors used 
a kernel method to judge the discrepancy between two random variables. And 
the proposed statistic, named Maximum Mean Discrepancy (MMD), calculates 
the distance between the means of the two random variables mapped into a 
Reproducing Kernel Hilbcrt Space. Then the authors presented some critical 
statistical analytic results. The first one is that if the kernel is universal [T7], 
then MMD=0 if and only if the two random variables are the same. The au- 
thors also showed that the empirical MMD converges in probability at rate 
l/\/total number of the samples. In this paper, we align the priors of the two 
domains in the common space by minimizing the squared MMD statistic on the 
samples mapped in the common spaces. The formulation is as follows 

MMD 2 {X,y) = 

(4) 

where k{-, •) is an universal kernel function, for example, Gaussian kernel. Un- 
fortunately, minimizing the equation 2] with respect to W x and W y is difficult 
due to 1) it is nonconvex and 2) W x and W y are embedded in kernel which 
is nonlinear. Thus to make the issue tractable, instead we just use the linear 
kernel to make equation [4] convex and reparameterize it using matrix M £ S + . 
For this, let 



, AT, Ny 



x i=i v j=i 

The squared MMD now can be simplified to: 

MMD 2 M {X,y)=z T Mz (6) 

So far, the alignments of the priors and the posteriors both simply depend 
on the PSD matrix M, however such PSD constraint is relatively troublesome 
for optimization. Fortunately, LogDct-function regularized model can automat- 
ically keep the PSD property of the M in the optimization process while still 
hold the convexity [14]. Consequently we use the LogDct function to regularize 
the matrix M as follows: 

LogDet(M, M ) = fr(MM _1 ) - log det(MM _1 ) - dim(M) (7) 

where tr(-) is the trace operator, and dim(-) is the dimension function. 



(5) 
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Now by fusing the LogDet regularizer, we recast our MLHD model as follows 



mm M , t ,t LogDet{M,I) + \\MMD 2 M (X , y) + \ 2 LogDet{€,£ ) 
s.t. d 2 M {x uVj ) > & if If = 1) 

M G S+ 

where £ is a vector of slack variables, and £o is an initializing vector whose 
components equal u if corresponding to same-class constraints and I if corre- 
sponding to different-class constraints. / is the identity matrix. And Ai and A2 
are two trade-off parameters. This is still a convex model. 



4.3 Optimization 

In this section, we use Bregman Projection algorithm]!! [3] to optimize our 
model. The algorithm cyclically projects the current solution onto a single 
constraint with Bregman divergence, here the LogDet function. To facilitate 
the projection, we first relax the equation [8] and make its objective function 
only contain LogDet function as follows: 

mhiM,t,£ LogDet(M, I) + \\LogDet{t, to) + \-2LogDet(£, £0) 
s.t. d? M { Xi , Vj ) >t i:j if/; /; 

Vi)<& inf^/J (9) 
MMD 2 M {X 7 y) < t 
M G S+ 

where to is small positive number as the initialization of t. Note that t > is 
implied by constraint MMD 2 (X,y) < t, thus t can be placed in the LogDet 
function. 

Then we present the optimization method described in algorithm [TJ The 
algorithm also cyclically projects the current solution onto a single linear con- 
straint with LogDet function, consequently these projections can be analytically 
solved. Due to the LogDet function's property that it is only defined over PSD 
matrix set, the projected result is still restricted in S + . In fact, similar methods 
are also used in [13] . 



4.4 Kernelization 

The MLHD model established in equation [8] is linear, thus it is inappropriate 
for nonlinear circumstances. As a widely accepted solution, kernel method, 
through nonlincarly mapping the original samples into a high-dimensional space 
and conducting learning in that space, can conveniently convert a linear model 
into a nonlinear model[2T|. In this subsection, we present how to kernelize the 
above linear model. We first introduce some notations. Denote Q by 

g n (D :c +D y )x(N x + N y ) / 1Q x 
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Algorithm 1 Optimization algorithm for MLHD model 

Input: Source domain X, target domain Y, parameter Ai,A2 

Initialize: primer variables M = I,£ = £o,t = to, dual variables (3 = 0,( = 

while n < Maxlter do 

— Brcgman Projection on distance constraints — 
for Each distance constraint do 

1: Solving the following problem by Lagrangian method, and getting its 
Lagrangian multiplier a 

min LogDetiM, M") + XiLogDet(lij,&A 

s.t. zf;Mzij = & 



2: Update primer variables M, £jj and dual variable 
If If = I], then 5 = 1, else 5 = -1. 

P = zjM^.z 



a = min(A 3 -,^-(l/p-l/eS)) 



5X2 

1 + A 

Pij = P i j - a 

£y = A2$/(A 2 +*a$) 

M = M % + X 2 + 5a^ KZijZljMrj 

end for 

— Bregman Projection on MMD constraint — 
1: Solving the following problem by Lagrangian method, and getting its 
Lagrangian multiplier 77 

min LogDetiM, M n ) + X x LogDet(t, t n ) 

M,t 

s.t. z T Mz = t 



2: Update primer variables M,t and dual variable £ 

r) = —min(C,, —rj) 
C = ( + niin((,-T]) 

M = M n - -= — M n zz T M r 

z T M n z - i] 



t n X 



t n i] + Ai 



end while 



Denote the kernel function defined on X,y by k x (-, •) and k y (-, •) respectively. 
Let K x , K y be the kernel matrix with the {i,j) th entry be k x {x%, Xj) and k y (yi, yj) 
respectively and the kernel matrix K be 

Tl(N x +N y )x(N x +N y ) ^ 

Let e,j = [e[ — eJ] T where is a vector with only the ith entry being 1. 
Then the squared metric in equation [1] can be cast as d?(xi,yj) = e[jQ T MQeij. 
Let e = [In^/Nx — In /N y \ where Iat is the N dimensional vector with 
all entries be 1. Then the squared MMD in equation [6] can be rewritten as 
MMD 2 {X i y) = e T Q T MQe. 

We follows the idea in [T3] to kernelize our MLHD model. Specifically, we 
first show that the range space of the matrix parameter M in equation|S]is in the 
range space of Q, then derive an equivalent optimization problem which only 
depends on the inner product defined in the source and the target domains. 
Finally, the inner product can be substituted with any kernel function. The 
concrete kernelization is summarized in the following theorems 14.11 and 14.21 
Although our kernelization looks like that in [T3], there are some difference: 1) 
our model focuses on metric thus its parameter matrix M is PSD matrix while 
the parameter matrix in [12] is asymmetric rectangle matrix, 2) the regularizer 
used in our model is LogDet function while is Frobenius norm in [12j . 

In the following, we shown in theorem 14.11 that the range space of the matrix 
parameter M in equation [5] is in the range space of Q. 

Theorem 4.1 There exists an N x + N y dimensional matrix L £ 5+ such that 
the optimal solution M* to [3 is of the form as follows 

M* = QK~ 1/2 LK- 1/2 Q T (12) 



K = 



K 



x 



Ky 



Proof Apparently, M £ S+, since LogDet only defines on <S+. Let Q± consists 
of the basis vectors spanning the null space of Q T , i.e., Q T Q± = 0. Then M 
can be decomposed into two parts as follows 

M = QLQ T + Q ± LQl (13) 

where L is some PSD matrix. It is easy to show that the second term Q±LQ 1 [ 
has no influence on d 2 (xi,yj) and MMD 2 (X, y). Consequently the only term in 
equation [8] influenced by the second term of Q is the LogDet term. Fortunately, 
the LogDet term is only determined by the eigenvalues of M 

LogDet{M, I) = ^ o^(M) + ^ log a 4 (M) - dim(M) (14) 

i i 

where o~i{-) is the ith largest eigenvalue. And according to matrix perturbation 
theory [52], a t (QLQ T ) < a l {QLQ T + Q^LQ^) when both QLQ T and Q±LQ^ 
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are PSD matrices. Thus, to minimize the objective in equation [FJ we should 
discard Q±LQ\ term and let M* = QLQ T . Finally by transforming L = 
K^LK 1 / 2 , we can write M* = QK^^LK^^Q? . | 



Then based on the above theorem 14. 11 we show in the following theorem [4 
that an equivalent optimization problem can be derived and only involves the 
inner products defined in the source and the target domains. 

Theorem 4.2 If L* is the optimal solution to the following problem: 

min L t $ LogDet(L, I) + \ x LogDet{t, t°) + \ 2 LogDet(£, £°) 
s'.t. <! j h i 2 I.K % if i: /? 

e T K x l 2 LK x l 2 e < t 
then M* = QK- 1 l 2 L k K- 1 l 2 Q T . 
Proof Note that 

QK- 1 ' 2 



(15) 



XK X ^ 2 



YKy 1 ' 2 



(16) 



1/2 1/2 

is an orthogonal matrix because both XK X and YK Y are orthogonal 
matrices. Let the eigen-decomposition of L be L = UY^U T . Then M — 
{QK- 1 / 2 U)J:{U T K- 1/2 Q t ), which is the eigcn-dccomposition of M. Conse- 
quently o~i(M) = o~i(L), which means 

LogDet(M, I) = LogDet(L, I) + const (17) 

Also by substituting M = QK~ 1 / 2 LK~ 1 / 2 Q T into equations Q] and [6l we 
have 

d 2 { Xl , yj ) = efjK^LK^eij (18) 

MMD 2 {X,y)=e T K 1 l 2 LK l/2 e (19) 

By rewriting the equation|8jusing equations[T7l[T8jand[T9l we have the equivalent 
optimization problem[15j and also have M* = QK^ 1 / 2 L* K~ 1 / 2 Q T . 
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5 Experiments 



This section verifies the MLHD model experimentally through the comparison 
with some relevant baselines. We first visually demonstrate the idea of the 
MLHD model under two-dimensional source domain and three-dimensional tar- 
get domain. Then we conduct two experiments on the cross-language retrieval 
task and the heterogeneous domain object recognition task respectively. 

5.1 Baseline Methods 

• KCCA+NN 

Because of the different dimensional spaces where the source and the tar- 
get domains lie, nearest neighbor (NN) classifier cannot be applied directly. 
Consequently, we follow the methods used by Kulis et al[T2]. Specifically, 
we first apply kernel canonical correlation analysis(KCCA) to project the 
samples in the two domains into a common space, then run the NN clas- 
sifier. 

• KCCA+ITML 

Using information theory metric learning(ITML) algorithm[7] to adapt 
the discrepancy between different domains is proposed by Saenko et al.|19j 
However ITML cannot work in different dimensional spaces, thus KCCA 
is first applied. This baseline is also from Kulis et al.'s paper [T2~|. 

• Asymmetric Regularized Cross-domain transformation (ARC) 

This method was introduced by Kulis et al.[T2]. It learns a linear asym- 
metric transformation to compute the cross-domain similarity score, and 
its mathematical formulation is as follows 

mm \\W\\l + X ( £ ma X (0,l-xfWy j f+ £ max(0, xfWyj - u) 2 J 
y?=^ / 

(20) 

The differences between ARC and MLHD are that, 1) it learns the simi- 
larity instead of the metric, 2) it does not consider the alignment of priors, 
thus fails to exploit the abundant unlabeled samples. 

5.2 Toy Problem 

To demonstrate the benefit of respectively aligning both priors and posteriors 
of the two domains, we construct a two-dimensional source domain and a three- 
dimensional target domain, as depicted in figure [2] Both of them have two 
classes. And in the source domain, each class has 40 labeled samples randomly 
drawn from two Gaussian distributions. While in the target domain, to demon- 
strate the efficacy of aligning prior, each class has 40 unlabeled samples and 2 
labeled samples, deliberately sampled with bias. 
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2 dimensional labeled source data 2 dimensional partially-labeled source data 




-10 1 -1 ote 1 



Figure 2: Source and target domains 



Figure shows the results of the three baselines and our algorithm. Clearly, 
CCA does not map well into common space. The samples in each class from both 
domains lie mainly in a strip, and the unlabeled samples totally are separated 
from the labeled ones. Predictably, the nearest neighbor classifier will report a 
bad accuracy under this situation. ITML makes the situation better, but still 
fails to align the two domains well. The unlabeled samples are still separated 
from the labeled ones. ARC only utilizes the labeled samples of the two domains. 
Although the two classes, of either the source or the target domain, are separated 
well, the distributions of the source and the target domains are obviously very 
different. On the contrary, besides aligning the posteriors, MLHD explicitly 
forces the prior distributions to be aligned. And the figure shows that the two 
classes, of both the source and the target domains, align withtogethcr. Note 
that, the classes of the target domain roughly concentrate in the center of the 
corresponding classes of the source domain. They does not align evenly, because 
in MLHD model, linear kernel is used rather than the required universal kernel 
in the MMD term for the optimization convenience. As a results, only the means 
of the priors are aligned, not the priors themselves. 
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Figure 3: Visualization on the toy problem 



5.3 Experiments on multilingual Reuters dataset 

The dataset of multilingual Reuters is collected by Amini et al. [T0. It contains 
6 large Reuters categories(CCAT, C15, ECAT, E21, GCAT and Mil) extracted 
from 5 different languages(English, French, German, Italian and Spanish) col- 
lections. Each document has been preprocessed and indexed using a standard 
preprocessing chain including removal of stopwords and low-frequency words, 
then is represented by TFIDF features. For convenience of computation, PCA 
is first applied to reduce the dimension of the source domain to 100, and the 
target domain to 150. 

Twenty groups are constructed by picking any two languages as a group, one 
for source domain and the other for target domain. For each group, ten trials of 
experiments are carried out, and the average accuracy as well as the standard 
deviation are reported. And for each trial, 20 labeled samples per class are 
randomly chosen from the source domain, 20 unlabeled plus 1 labeled samples 

1 http:/ /multilingreuters.iit.nrc.ca/RcutcrsMultiLingualMultiVicw.htm 
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per class are randomly chosen from the target domain as the training set, and 
300 samples are randomly chosen from the target domain as the testing set. In 
all of the trials, the RBF kernel is used. 

From the comparative results shown in table [TJwe can see that the base- 
line KCCA+NN, which simply adopts the Euclidean distance, yields the poor- 
est performance. The baseline KCCA+ITML is much better comparing with 
KCCA+NN, and implies that the common space produced by KCCA is not 
quite suitable. ARC algorithm does not require that the samples in the two 
domains are in the space with the same dimensionality, thus does not need a 
preprocessing of KCCA for dimension reduction. And the results of ARC are 
generally better than those of KCCA+ITML. Compared with ARC, MLHD 
further aligns the priors, and outperforms the ARC overall, which confirms the 
benefit of aligning the priors. 



Src-Tgt 


KCCA+NN 


KCCA+ITML 


ARC 


MLHD 


EN-FR 
EN-GR 
EN-IT 
EN-SP 


19.3+4.5 
18.3+3.0 
17.6+1.2 
22.3+5.6 


46.4+3.7 

42.6+6.5 
38.5+7.4 
41.6+7.4 


45.8+6.0 
43.5+7.0 
40.0+6.0 
42.5+7.5 


46.6+4.0 
45.0+4.0 
40.0+4.9 
44.1+5.6 


FR-EN 
FR-GR 
FR-IT 
FR-SP 


20.1+3.5 
17.6+2.1 
18.2+2.4 
17.8+2.7 


36.6+8.3 
36.3+8.1 
32.5+6.3 
34.7+6.6 


45.1+6.8 
40.3+7.3 
36.6+6.3 

37.6+8.3 


46.3+5.4 
40.0+7.8 

35.8+6.1 
39.1+8.1 


GR-EN 
GR-FR 
GR-IT 
GR-SP 


18.5+4.4 
17.5+1.1 
18.3+3.0 
19.7+2.9 


34.3+7.3 
41.7+8.6 
33.7+7.9 
42.9+8.0 


36.0+3.9 
44.1+6.9 

36.6+5.7 
43.5+7.3 


38.2+4.9 
43.1+6.7 
37.7+6.7 
45.7+6.4 


IT-EN 
IT-FR 
IT-GR 
IT-SP 


19.0+4.4 
18.7+3.0 
20.5+7.2 
17.2+0.6 


36.3+6.9 
37.6+8.3 
39.9+8.8 
36.1+7.8 


39.8+3.8 
41.2+7.0 
43.0+8.2 
37.6+6.2 


42.4+4.8 
44.6+6.6 
45.6+9.3 
41.6+8.3 


SP-EN 
SP-FR 
SP-GR 
SP-IT 


18.4+2.4 
17.7+1.4 
18.3+2.9 
17.8+1.5 


35.1+6.7 
41.8+9.6 
35.6+8.8 
37.4+3.4 


42.3+8.4 

43.3+4.7 
43.9+8.2 

38.3+4.5 


42.8+9.3 
44.8+6.2 
43.9+7.6 
40.0+6.4 



Table 1: Accuracy results of cross- language retrieval. The best performances 
are highlighted. 

The MLHD model has two important trade-off parameters: Ai for weighting 
the MMD term, and A2 for weighting the distance constraint term. To study 
how these parameters influence the performance of MLHD, we run the exper- 
iments with Ai taken from [10~ 3 10~ 2 10" 1 10° 10 1 10 2 ] and A 2 taken from 
[10~ 2 10" 1 10° 10 1 10 2 10 3 ]. The results are demonstrated in figure U From 
this figure, we can observe that in general the MLHD is not very sensitive to 
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the parameter configuration, especially to A2. Moreover, the accuracy increases 
as Ai increases. This also verifies the usefulness of aligning priors. 



Parameter study 




l f 10"-2 10--3 



Figure 4: Parameter study on Multilingual Reuters dataset 
5.4 Object Recognition Experiments 

In this subsection, we use the dataset provided by Kate et al. [19] . This dataset 
contains 3 image domains: Amazon(imagcs collected from Amazon.com), DSLR(high- 
resolution images taken from a digital DLR camera) and Webcam(low-resolution 
images taken from a web camera). Among them, images in Amazon domains 
are in a canonical pose with uniform background, and images in both DSLR and 
Webcam domains are taken with varying poses and backgrounds. Thus, in this 
experiments, we use DSLR and Webcam as the target domain separately. We 
follows the previous Kulis et al.[12]'s setting to extract image features. Specif- 
ically, all the images are first resized to 300x300 resolution. Then for each 
domains, three types of features are respectively extracted: 

• SURF600 SURF [2] features are extracted and clustered into a 600 visual 
words. Then each image is represented by a 600 dimensional histogram. 

• SURF800 Same processing as SURF600 besides clustering into a 800 
visual words. 

• SIFT900 SIFT [16] features are extracted and clustered into a 900 visual 
words. Then each image is represented by a 900 dimensional histogram. 

We use the images with SURF600 and SIFT900 features as the source domain 
respectively and construct 16 groups of experiments in tabic [2] The experiment 
settings are almost the same as those in the above experiment on multilingual 
dataset. Specifically, ten trials are run for each group. And for each trial, 
20 labeled samples per class are randomly chosen from the source domain. In 
the target domain, 10 unlabeled plus 1 labeled samples per class are randomly 
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chosen as training set and the rest constitute the test set. We don't use many 
unlabeled samples in the target domain due to the limited number of collected 
images in DSLR and Webcam domains. In all of the trials, the RBF kernel is 
again used. 

The experimental results are listed in tabic [2] 



Src-Tgt 


CCA+NN 

V_- \_J Ji^^J. 111 


CCA+TTML 


ARC 


IV J. _i_J _L _L A-J 


AmazonSurf600-WcbcamSift900 


15.7±2.6 


29.7±2.6 


30.1±3.2 


31.1±2.9 


AmazonSurf600-WebcamSurf800 


19.0±3.9 


29.0±3.1 


32.1±2.0 


30.6±2.6 


AmazonSift900-WebcamSurf600 


19.8±2.7 


29.9±2.0 


31.9±2.8 


30.6±3.0 


AmazonSift900-WebcamSurf800 


18.7±4.4 


30.1±4.6 


31.8±2.8 


31.4±3.5 


DslrSurf600-WebcamSift900 


16.1±1.9 


29.1±1.8 


29.8±1.9 


31.8±2.7 


DslrSurf600-WebcamSurf800 


14.6±3.2 


28.4±3.9 


31.4±3.2 


31.6±2.7 


DslrSift900-WebcamSurf600 


13.1±2.4 


27.4±3.7 


29.9±1.9 


29.4±1.3 


DslrSift900-WebcamSurf800 


14.5±3.4 


25.8±4.4 


29.3±2.5 


28.1±2.6 


AmazonSurf600-DslrSift900 


20.6±5.6 


28.8±5.1 


27.8±4.5 


29.3±4.1 


AmazonSurf600-DslrSurf800 


16.0±4.4 


25.4±5.2 


27.8±4.1 


23.5±4.3 


AmazonSift900-DslrSurf600 


14.7±4.9 


25.4±4.0 


26.5±3.8 


26.5±4.6 


AmazonSift900-DslrSurf800 


11.3±4.4 


22.6±4.0 


25.5±4.4 


25.5±4.6 


WcbcamSurf600-DslrSift900 


13.3±3.6 


27.8±2.6 


27.4±4.6 


29.8±3.5 


WebcamSurf600-DslrSurf800 


18.5±5.9 


27.7±4.6 


28.4±3.0 


31.3±3.9 


WebcamSift900-DslrSurf600 


13.1±2.4 


25.2±5.6 


27.5±4.1 


28.6±2.9 


WebcamSift900-DshSurf800 


14.5±3.4 


25.6±4.5 


26.9±4.3 


26.7±4.3 



Table 2: Object recognition accuracy. The best performances are highlighted. 

According to table [2j we still observe the ineffectiveness of CCA+NN, and 
the accuracy improvement with an additional ITML metric learning step. More- 
over, ARC outperforms CCA+ITML as usual on almost all groups. However, 
superiority of MLHD is not very significant comparing with ARC algorithm in 
this experiment. Although, on those groups whose target domains are DSLR, 
the performances of MLHD are generally better than those of ARC, the two al- 
gorithms' performances are comparable on those groups whose target domains 
arc Webcam. The reason may lie in that the training samples in target domains 
in this experiment are not enough due to the limited number of collected images. 
Note that only 11 samples per class are used. Consequently, both priors might 
be aligned biascdly forsuch small number of training samples. 

6 Conclusion 

In this paper, we proposed the MLHD model to learn a metric defined across the 
heterogeneous source and target domains, which is seldon touched to the best of 
our knowledge. The proposed model aligns both the priors and posteriors in the 
source and the target domains at the same time. Then we show that our model 
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can be reparametrized into a single PSD matrix and use a LogDet function to 
regularize the model for the convenience of optimization. In the following, we 
give out the optimization method based on Brcgman Projection method. Next, 
we also show that the model can be easily kernelized by solving an equivalent 
optimization problem. Finally, we validate its effectiveness on the multilingual 
retrieval task and the object recognition task under various situations. 
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